Strategies for Large-Scale Entity Resolution Based on Inverted Index Data Partitioning

نویسنده

John R. Talburt

چکیده

Inverted indexing is a commonly used technique for improving the performance of entity resolution algorithms by reducing the number of pair-wise comparisons necessary to arrive at acceptable results. This chapter describes how inverted indexing can also be used as a data partitioning strategy to perform entity resolution on large datasets in a distributed processing environment. This chapter discusses the importance of index-to-rule alignment, pre-resolution index closure, post-resolution link closure, and workflows for record-based identity capture and update, and attribute-based identity capture and update in a distributed processing environment.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, howev...

متن کامل

Inverted index maintenance strategy for flashSSDs: Revitalization of in-place index update strategy

An inverted index is a core data structure of Information Retrieval systems, especially in search engines. Since the search environments have become more dynamic, many on-line index maintenance strategies have been proposed. Previous strategies were designed for HDDs. Consequently, in order to avoid expensive random access cost, Merge-based strategies have been preferred to In-place index updat...

متن کامل

Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems

Shared-nothing, parallel text retrieval systems require an inverted index, representing a document collection, to be partitioned among a number of processors. In general, the index can be partitioned based on either the terms or documents in the collection, and the way the partitioning is done greatly affects the query processing performance of the parallel system. In this work, we investigate ...

متن کامل

Distributed Query Processing Using Partitioned Inverted Files

In this paper, we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that offers concurrent query service. The distributed system adopts a network of workstations model and the client-server paradigm. The document collection is indexed with an inverted file. We adopt two distinct strategies of index partitioning in the distribute...

متن کامل

A Two-Tier Distributed Full-Text Indexing System

The performance of indexing systems is very important for a search engine. Usually, indexing systems on large-scale clusters can provide high search efficiency, but it brings expensive hardware costs. The costs would be greatly reduced if a distributed indexing system runs on small-scale clusters connected by the Internet. Two current inverted file partitioning schemes: document partitioning an...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Strategies for Large-Scale Entity Resolution Based on Inverted Index Data Partitioning

نویسنده

چکیده

منابع مشابه

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Inverted index maintenance strategy for flashSSDs: Revitalization of in-place index update strategy

Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems

Distributed Query Processing Using Partitioned Inverted Files

A Two-Tier Distributed Full-Text Indexing System

عنوان ژورنال:

اشتراک گذاری